Skip to content

[newchem-cpp] Improve implementation of FrozenStringIdxBiMap#484

Open
mabruzzo wants to merge 32 commits intograckle-project:newchem-cppfrom
mabruzzo:ncc/better-FrozenKeyIdxBiMap
Open

[newchem-cpp] Improve implementation of FrozenStringIdxBiMap#484
mabruzzo wants to merge 32 commits intograckle-project:newchem-cppfrom
mabruzzo:ncc/better-FrozenKeyIdxBiMap

Conversation

@mabruzzo
Copy link
Collaborator

@mabruzzo mabruzzo commented Jan 6, 2026

This should be reviewed after #462 has been merged


Background Context

Recall that PR #451 introduced the FrozenStringIdxBiMap. It is a datatype that implements a bidirectional map (aka a bidirectional dictionary), that can be used to map between a unique set of n strings (keys) and a unique set of indexes (with values of 0 through n-1) and vice-versa. In PR #451 I intentionally adopted an overly-simplistic implementation for the datatype (i.e. it just did linear searches over the full array) to make PR easier to review.

Description

This PR totally changes the underlying implementation to improve efficiency. These changes have NO IMPACT on the interface. Under the hood, FrozenStringIdxBiMap is now implemented in terms of a Hash Table.

On a historical note, this PR effectively supersedes #270 (which originally proposed the same implementation)

Tradeoffs:

Benefit

This is clear-cut: average time complexity to search for a key goes from Θ(n) (linear) to Θ(1) (constant)

Cost

The storage allocated from the heap is larger. Suppose s is the average number of bytes needed to track a string allocation and n keys in a bimap:

  • the old approach only allocated n*(8 + s) bytes (we just allocate n pointers)
  • the new approach requires n*(2+16*k + s) bytes
    • k * n * 16 bytes for each row of the hash table and 2*n bytes (aka n std::uint16_t values) for rapid reverse lookup
    • k is the ratio between capacity & number of keys (a larger k reduces collisions). Currently, it's 2 (but we can reduce it!)

Let's throw out some concrete numbers. Suppose we have 100 items (frankly, that seems a pretty high) and the average length is ~15 bytes

Old Approach New Approach
reference string literals (i.e. s=0) 800 bytes 3400 bytes
copy each string (i.e. s=15) 2300 bytes 4900 bytes

The fact that we are talking about less than 5 killobytes makes me think this cost isn't terribly significant (and, if it's a concern, we could try reducing k from 2 to 1.5)

  • for 100 items (which seems a little high) where we just reference memory in string literals (s=0) we are talking about 800 bytes vs 3400 bytes
  • for 100 items, where we take ownership of the strings and strings have an average length of ~15 bytes we are talking about 2300 bytes vs 4900 bytes

In terms of absolute numbers, 5kB clearly isn't a concern (a linux page size is 4 kB). And if we ever refactor enough and use this in all the places I think it could be useful (e.g. using it to provide map-access to chemistry parameters or grackle fields, and for one or two cases to help with setup), we are talking about maybe creating 5 instances while initializing grackle (and most would promptly get de-allocated)

Other thoughts

Since this just slots into the interface defined by #451, there isn't a ton of urgency to this PR

There is some unforunate consequences, but if we don't do this you end
up with some pretty confusing looking code... (the confusion primarily
arises if new_FrozenKeyIdxBiMap or drop_FrozenKeyIdxBiMap has different
behavior from other functions with similar names). I really think it
would be better to make this into a simple C++ class with a constructor
and destructor, but thats a topic for another time
One minor change involved formally making it illegal to construct a
GrainSpeciesInfo instance that holds 0 dust species. That's perfectly
fine. `GrainSpeciesInfo` holds some redundant info that we will remove
in the a subsequent commit
Modify tests to start using `FrozenKeyIdxBiMap` rather than
`GrainSpeciesInfoEntry::onlygrainsp_idx` and
`GrainSpeciesInfoEntry::name`. In the next commit, we'll remove these
members from `GrainSpeciesInfoEntry`
Delete `GrainSpeciesInfoEntry::onlygrainsp_idx` and
`GrainSpeciesInfoEntry::name`.
Also made a few assorted fixes
Both changes essentially affected 2 constants, which prior to this
commit were known as `keylen_max` & `invalid_val`.

The 2 contained constants are really just implementation details that
users of the inferface for `FrozenKeyIdxBiMap` shouldn't need to know
anything about
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant